{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WfGpInivS0fG"
   },
   "source": [
    "<h2 align=\"center\">点击下列图标在线运行HanLP</h2>\n",
    "<div align=\"center\">\n",
    "\t<a href=\"https://colab.research.google.com/github/hankcs/HanLP/blob/doc-zh/plugins/hanlp_demo/hanlp_demo/zh/dep_mtl.ipynb\" target=\"_blank\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "\t<a href=\"https://mybinder.org/v2/gh/hankcs/HanLP/doc-zh?filepath=plugins%2Fhanlp_demo%2Fhanlp_demo%2Fzh%2Fdep_mtl.ipynb\" target=\"_blank\"><img src=\"https://mybinder.org/badge_logo.svg\" alt=\"Open In Binder\"/></a>\n",
    "</div>\n",
    "\n",
    "## 安装"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IYwV-UkNNzFp"
   },
   "source": [
    "无论是Windows、Linux还是macOS，HanLP的安装只需一句话搞定："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "1Uf_u7ddMhUt"
   },
   "outputs": [],
   "source": [
    "!pip install hanlp -U"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pp-1KqEOOJ4t"
   },
   "source": [
    "## 加载模型\n",
    "HanLP的工作流程是先加载模型，模型的标示符存储在`hanlp.pretrained`这个包中，按照NLP任务归类。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "id": "0tmKBu7sNAXX"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'OPEN_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH': 'https://file.hankcs.com/hanlp/mtl/open_tok_pos_ner_srl_dep_sdp_con_electra_small_20201223_035557.zip',\n",
       " 'OPEN_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH': 'https://file.hankcs.com/hanlp/mtl/open_tok_pos_ner_srl_dep_sdp_con_electra_base_20201223_201906.zip',\n",
       " 'CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH': 'https://file.hankcs.com/hanlp/mtl/close_tok_pos_ner_srl_dep_sdp_con_electra_small_20210111_124159.zip',\n",
       " 'CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH': 'https://file.hankcs.com/hanlp/mtl/close_tok_pos_ner_srl_dep_sdp_con_electra_base_20210111_124519.zip',\n",
       " 'CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ERNIE_GRAM_ZH': 'https://file.hankcs.com/hanlp/mtl/close_tok_pos_ner_srl_dep_sdp_con_ernie_gram_base_aug_20210904_145403.zip',\n",
       " 'UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_MT5_SMALL': 'https://file.hankcs.com/hanlp/mtl/ud_ontonotes_tok_pos_lem_fea_ner_srl_dep_sdp_con_mt5_small_20210228_123458.zip',\n",
       " 'UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_XLMR_BASE': 'https://file.hankcs.com/hanlp/mtl/ud_ontonotes_tok_pos_lem_fea_ner_srl_dep_sdp_con_xlm_base_20210602_211620.zip',\n",
       " 'NPCMJ_UD_KYOTO_TOK_POS_CON_BERT_BASE_CHAR_JA': 'https://file.hankcs.com/hanlp/mtl/npcmj_ud_kyoto_tok_pos_ner_dep_con_srl_bert_base_char_ja_20210914_133742.zip'}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import hanlp\n",
    "hanlp.pretrained.mtl.ALL # MTL多任务，具体任务见模型名称，语种见名称最后一个字段或相应语料库"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EmZDmLn9aGxG"
   },
   "source": [
    "调用`hanlp.load`进行加载，模型会自动下载到本地缓存。自然语言处理分为许多任务，分词只是最初级的一个。与其每个任务单独创建一个模型，不如利用HanLP的联合模型一次性完成多个任务："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "elA_UyssOut_"
   },
   "source": [
    "## 依存句法分析\n",
    "任务越少，速度越快。如指定仅执行依存句法分析："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 70
    },
    "id": "BqEmDMGGOtk3",
    "outputId": "2a0d392f-b99a-4a18-fc7f-754e2abe2e34"
   },
   "outputs": [],
   "source": [
    "doc = HanLP(['2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。', '阿婆主来到北京立方庭参观自然语义科技公司。'], tasks='dep')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "返回值为一个[Document](https://hanlp.hankcs.com/docs/api/common/document.html):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"tok/fine\": [\n",
      "    [\"2021年\", \"HanLPv2.1\", \"为\", \"生产\", \"环境\", \"带来\", \"次\", \"世代\", \"最\", \"先进\", \"的\", \"多\", \"语种\", \"NLP\", \"技术\", \"。\"],\n",
      "    [\"阿婆主\", \"来到\", \"北京\", \"立方庭\", \"参观\", \"自然\", \"语义\", \"科技\", \"公司\", \"。\"]\n",
      "  ],\n",
      "  \"dep\": [\n",
      "    [[6, \"tmod\"], [6, \"nsubj\"], [6, \"prep\"], [5, \"nn\"], [3, \"pobj\"], [0, \"root\"], [8, \"amod\"], [15, \"nn\"], [10, \"advmod\"], [15, \"rcmod\"], [10, \"assm\"], [13, \"nummod\"], [15, \"nn\"], [15, \"nn\"], [6, \"dobj\"], [6, \"punct\"]],\n",
      "    [[2, \"nsubj\"], [0, \"root\"], [4, \"nn\"], [2, \"dobj\"], [2, \"conj\"], [9, \"nn\"], [9, \"nn\"], [9, \"nn\"], [5, \"dobj\"], [2, \"punct\"]]\n",
      "  ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "print(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`doc['dep']`为句子们的依存句法树列表，第`i`个二元组表示第`i`个单词的`[中心词的下标, 与中心词的依存关系]`。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wxctCigrTKu-"
   },
   "source": [
    "可视化依存句法树："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Zo08uquCTFSk",
    "outputId": "c6077f2d-7084-4f4b-a3bc-9aa9951704ea"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dep Tree    \tToken    \tRelati\n",
      "────────────\t─────────\t──────\n",
      " ┌─────────►\t2021年    \ttmod  \n",
      " │┌────────►\tHanLPv2.1\tnsubj \n",
      " ││┌─►┌─────\t为        \tprep  \n",
      " │││  │  ┌─►\t生产       \tnn    \n",
      " │││  └─►└──\t环境       \tpobj  \n",
      "┌┼┴┴────────\t带来       \troot  \n",
      "││       ┌─►\t次        \tamod  \n",
      "││  ┌───►└──\t世代       \tnn    \n",
      "││  │    ┌─►\t最        \tadvmod\n",
      "││  │┌──►├──\t先进       \trcmod \n",
      "││  ││   └─►\t的        \tassm  \n",
      "││  ││   ┌─►\t多        \tnummod\n",
      "││  ││┌─►└──\t语种       \tnn    \n",
      "││  │││  ┌─►\tNLP      \tnn    \n",
      "│└─►└┴┴──┴──\t技术       \tdobj  \n",
      "└──────────►\t。        \tpunct \n",
      "\n",
      "Dep Tree    \tTok\tRelat\n",
      "────────────\t───\t─────\n",
      "         ┌─►\t阿婆主\tnsubj\n",
      "┌┬────┬──┴──\t来到 \troot \n",
      "││    │  ┌─►\t北京 \tnn   \n",
      "││    └─►└──\t立方庭\tdobj \n",
      "│└─►┌───────\t参观 \tconj \n",
      "│   │  ┌───►\t自然 \tnn   \n",
      "│   │  │┌──►\t语义 \tnn   \n",
      "│   │  ││┌─►\t科技 \tnn   \n",
      "│   └─►└┴┴──\t公司 \tdobj \n",
      "└──────────►\t。  \tpunct\n"
     ]
    }
   ],
   "source": [
    "doc.pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "转换为CoNLL格式："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\t2021年\t_\t_\t_\t_\t6\ttmod\t_\t_\n",
      "2\tHanLPv2.1\t_\t_\t_\t_\t6\tnsubj\t_\t_\n",
      "3\t为\t_\t_\t_\t_\t6\tprep\t_\t_\n",
      "4\t生产\t_\t_\t_\t_\t5\tnn\t_\t_\n",
      "5\t环境\t_\t_\t_\t_\t3\tpobj\t_\t_\n",
      "6\t带来\t_\t_\t_\t_\t0\troot\t_\t_\n",
      "7\t次\t_\t_\t_\t_\t8\tamod\t_\t_\n",
      "8\t世代\t_\t_\t_\t_\t15\tnn\t_\t_\n",
      "9\t最\t_\t_\t_\t_\t10\tadvmod\t_\t_\n",
      "10\t先进\t_\t_\t_\t_\t15\trcmod\t_\t_\n",
      "11\t的\t_\t_\t_\t_\t10\tassm\t_\t_\n",
      "12\t多\t_\t_\t_\t_\t13\tnummod\t_\t_\n",
      "13\t语种\t_\t_\t_\t_\t15\tnn\t_\t_\n",
      "14\tNLP\t_\t_\t_\t_\t15\tnn\t_\t_\n",
      "15\t技术\t_\t_\t_\t_\t6\tdobj\t_\t_\n",
      "16\t。\t_\t_\t_\t_\t6\tpunct\t_\t_\n",
      "\n",
      "1\t阿婆主\t_\t_\t_\t_\t2\tnsubj\t_\t_\n",
      "2\t来到\t_\t_\t_\t_\t0\troot\t_\t_\n",
      "3\t北京\t_\t_\t_\t_\t4\tnn\t_\t_\n",
      "4\t立方庭\t_\t_\t_\t_\t2\tdobj\t_\t_\n",
      "5\t参观\t_\t_\t_\t_\t2\tconj\t_\t_\n",
      "6\t自然\t_\t_\t_\t_\t9\tnn\t_\t_\n",
      "7\t语义\t_\t_\t_\t_\t9\tnn\t_\t_\n",
      "8\t科技\t_\t_\t_\t_\t9\tnn\t_\t_\n",
      "9\t公司\t_\t_\t_\t_\t5\tdobj\t_\t_\n",
      "10\t。\t_\t_\t_\t_\t2\tpunct\t_\t_\n"
     ]
    }
   ],
   "source": [
    "print(doc.to_conll())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XOsWkOqQfzlr"
   },
   "source": [
    "为已分词的句子执行依存句法分析："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 70
    },
    "id": "bLZSTbv_f3OA",
    "outputId": "111c0be9-bac6-4eee-d5bd-a972ffc34844"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dep Tree   \tToken\tRelati\n",
      "───────────\t─────\t──────\n",
      " ┌────────►\tHanLP\tnsubj \n",
      " │┌─►┌─────\t为    \tprep  \n",
      " ││  │  ┌─►\t生产   \tnn    \n",
      " ││  └─►└──\t环境   \tpobj  \n",
      "┌┼┴────────\t带来   \troot  \n",
      "││  ┌─────►\t次世代  \tnn    \n",
      "││  │   ┌─►\t最    \tadvmod\n",
      "││  │┌─►├──\t先进   \trcmod \n",
      "││  ││  └─►\t的    \tassm  \n",
      "││  ││ ┌──►\t多语种  \tnn    \n",
      "││  ││ │┌─►\tNLP  \tnn    \n",
      "│└─►└┴─┴┴──\t技术   \tdobj  \n",
      "└─────────►\t。    \tpunct \n",
      "\n",
      "Dep Tree        \tTok\tRelation \n",
      "────────────────\t───\t─────────\n",
      "          ┌─►┌──\t我  \tassmod   \n",
      "          │  └─►\t的  \tassm     \n",
      "       ┌─►└─────\t希望 \ttop      \n",
      "┌┬─────┴────────\t是  \troot     \n",
      "│└─►┌───────────\t希望 \tccomp    \n",
      "│   │     ┌─►┌──\t张晚霞\tassmod   \n",
      "│   │     │  └─►\t的  \tassm     \n",
      "│   │  ┌─►└─────\t背影 \tnsubjpass\n",
      "│   └─►└──┬─────\t被  \tccomp    \n",
      "│         │  ┌─►\t晚霞 \tnsubj    \n",
      "│         └─►└──\t映红 \tdep      \n",
      "└──────────────►\t。  \tpunct    \n"
     ]
    }
   ],
   "source": [
    "HanLP([\n",
    "    [\"HanLP\", \"为\", \"生产\", \"环境\", \"带来\", \"次世代\", \"最\", \"先进\", \"的\", \"多语种\", \"NLP\", \"技术\", \"。\"],\n",
    "    [\"我\", \"的\", \"希望\", \"是\", \"希望\", \"张晚霞\", \"的\", \"背影\", \"被\", \"晚霞\", \"映红\", \"。\"]\n",
    "  ], tasks='dep', skip_tasks='tok*').pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 注意\n",
    "Native API的输入单位限定为句子，需使用[多语种分句模型](https://github.com/hankcs/HanLP/blob/master/plugins/hanlp_demo/hanlp_demo/sent_split.py)或[基于规则的分句函数](https://github.com/hankcs/HanLP/blob/master/hanlp/utils/rules.py#L19)先行分句。RESTful同时支持全文、句子、已分词的句子。除此之外，RESTful和native两种API的语义设计完全一致，用户可以无缝互换。"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
   "name": "dep_mtl.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
